Exploring Variable Interactions




Table of Contents




Loading Data

We'll start by loading the abalone data. The strategy for exploring variable interactions is to try combinations of multiple variables and try to identify why even higher-order models were doing such a poor job of matching the data.

In [1]:
%matplotlib notebook

# numbers
import numpy as np
import pandas as pd

# stats
import statsmodels.api as sm
import scipy.stats as stats

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# utils
import os, re
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
In [2]:
# Load data, filter outliers (zero height/length)

def abalone_load(data_file, infant=False):
    # x data labels
    xnlabs = ['Sex']
    xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
    xlabs = xnlabs + xqlabs

    # y data labels
    ylabs = ['Rings']

    # data
    df = pd.read_csv(data_file, header=None, sep=' ', names=xlabs+ylabs)
    
    if(infant):
        new_df = df[ df['Sex']=='I' ]
    else:
        new_df = df[ df['Sex']<>'I' ]
    
    return new_df

def adult_abalone_load(data_file):
    return abalone_load(data_file,False)

def abalone_removeoutliers(df,verbose=False):
    len1 = len(df)
    df = df[ df['Height'] < 0.30 ]
    df = df[ df['Height'] > 0.001 ]
    df = df[ df['Length'] > 0.001 ]
    df = df[ df['Diameter'] > 0.001 ]
    len2 = len(df)
    if(verbose):
        print "Removed",(len1-len2),"outliers"
    return df
In [3]:
# x data labels
xnlabs = ['Sex']
xqlabs = ['Length','Diameter','Height','Whole weight','Shucked weight','Viscera weight','Shell weight']
xlabs = xnlabs + xqlabs

# y data labels
ylabs = ['Rings']
In [4]:
adt_df = abalone_removeoutliers(adult_abalone_load('abalone/Dataset.data'),True)
Removed 2 outliers
In [5]:
print adt_df.columns
Index([u'Sex', u'Length', u'Diameter', u'Height', u'Whole weight',
       u'Shucked weight', u'Viscera weight', u'Shell weight', u'Rings'],
      dtype='object')

Calculating Derived Quantities

Next, we'll calculate some derived quantities - mostly normalizing and multiplying variables, and taking their ratios.

In [6]:
adt_df['Length (Normalized)']    = adt_df['Length']/adt_df['Length'].mean()
adt_df['Diameter (Normalized)']  = adt_df['Diameter']/adt_df['Diameter'].mean()
adt_df['Height (Normalized)']    = adt_df['Height']/adt_df['Height'].mean()

adt_df['Length (Max Normalized)']    = adt_df['Length']/adt_df['Length'].max()
adt_df['Diameter (Max Normalized)']  = adt_df['Diameter']/adt_df['Diameter'].max()
adt_df['Height (Max Normalized)']    = adt_df['Height']/adt_df['Height'].max()

adt_df['Volume']                 = adt_df['Length']*adt_df['Diameter']*adt_df['Height']


adt_df['Volume (Normalized)']    = adt_df['Length (Normalized)']*adt_df['Diameter (Normalized)']*adt_df['Height (Normalized)']
adt_df['Dimension (Normalized)'] = adt_df['Volume (Normalized)'].apply( lambda x : pow(x,1.0/3.0) )

adt_df['Volume (Max Normalized)']    = adt_df['Length (Max Normalized)']*adt_df['Diameter (Max Normalized)']*adt_df['Height (Max Normalized)']
adt_df['Volume (VMax Normalized)'] = adt_df['Volume']/adt_df['Volume'].max()




adt_df['Volume (Log Normalized)'] = adt_df['Volume (Normalized)'].apply( lambda x : np.log(x) )
adt_df['Volume (Log Max Normalized)'] = adt_df['Volume (Max Normalized)'].apply( lambda x : np.log(x) )
adt_df['Volume (Log)']            = adt_df['Volume'].apply( lambda x : np.log(x) )



adt_df['Area']                    = 3.14159*adt_df['Diameter'].apply(lambda x : x*x)
adt_df['Inverse Volume']          = adt_df['Volume'].apply( lambda x : x / (1.00000 - x) )

adt_df['Viscera-Shell Weight Ratio']            = adt_df['Viscera weight']/adt_df['Shell weight']
adt_df['Viscera-Shucked Weight Ratio']            = adt_df['Viscera weight']/adt_df['Shucked weight']
adt_df['Viscera-Whole Weight Ratio']            = adt_df['Viscera weight']/adt_df['Whole weight']

adt_df['Whole weight (Normalized)']   = adt_df['Whole weight']/adt_df['Whole weight'].max()
adt_df['Shell weight (Normalized)']   = adt_df['Shell weight']/adt_df['Shell weight'].max()
adt_df['Shucked weight (Normalized)'] = adt_df['Shucked weight']/adt_df['Shucked weight'].max()
adt_df['Viscera weight (Normalized)'] = adt_df['Viscera weight']/adt_df['Viscera weight'].max()

adt_df['Rings (Log)']             = adt_df['Rings'].apply( lambda x : np.log(x) )
adt_df['Rings (Normalized)']      = adt_df['Rings'].apply( lambda x : float(x)  )/adt_df['Rings'].max()
In [7]:
fig = plt.figure()
sns.distplot( adt_df['Volume (Log Normalized)'], label="Volume (Log Norm)")
sns.distplot( adt_df['Volume (Normalized)'], label="Volume (Normalized)")
plt.legend()
/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py:564: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n = np.zeros(bins, ntype)
/usr/local/lib/python2.7/site-packages/numpy/lib/function_base.py:611: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
  n += np.bincount(indices, weights=tmp_w, minlength=bins).astype(ntype)
Out[7]:
<matplotlib.legend.Legend at 0x109014450>
In [8]:
fig = plt.figure()

labels = ['Viscera-Shell Weight Ratio','Viscera-Shucked Weight Ratio','Viscera-Whole Weight Ratio']
for lab in labels:
    sns.distplot( adt_df[lab], label = lab)

plt.legend()
plt.show()
In [9]:
fig = plt.figure()

plt.plot( adt_df['Length (Normalized)'],   adt_df['Rings'], 'r*' )
plt.plot( adt_df['Diameter (Normalized)'], adt_df['Rings'], 'b*' )
plt.plot( adt_df['Height (Normalized)'],   adt_df['Rings'], 'g*' )

plt.show()

Scatterplots

Let's look at some color-coded scatterplots of quantities to see if any derived quantities map nicely to the number of rings.

In [10]:
plt.figure()

#colors=adt_df['Rings'].apply(lambda x : float(x)).values
colors=adt_df['Rings (Log)']
colors -= colors.min()
colors *= (1.0/colors.max())
cm = plt.cm.get_cmap('RdYlBu')
cm = plt.cm.ocean
cm = plt.cm.copper
cm = plt.cm.magma
cm = plt.cm.RdBu
cm = plt.cm.Purples
cm = plt.cm.OrRd

cm = plt.cm.jet
plot_color = cm(colors)

#plt.plot( adt_df['Volume (Log Normalized)'], adt_df['Rings'], 'b*' )
plt.scatter( adt_df['Volume (Log Normalized)'], adt_df['Viscera-Shell Weight Ratio'], marker='*', alpha=0.3, c=plot_color)
plt.xlabel("Volume (Log Normalized)")
plt.ylabel("Viscera-Shell Weight Ratio")
plt.show()
In [11]:
#colors=adt_df['Rings'].apply(lambda x : float(x)).values
colors=adt_df['Rings (Log)']
colors -= colors.min()
colors *= (1.0/colors.max())
cm = plt.cm.get_cmap('RdYlBu')
cm = plt.cm.jet

plot_color = cm(colors)


fig = plt.figure(figsize=(14,8))
ax1,ax2,ax3 = fig.add_subplot(231), fig.add_subplot(232), fig.add_subplot(233)
ax4,ax5,ax6 = fig.add_subplot(234), fig.add_subplot(235), fig.add_subplot(236)


sct1 = ax1.scatter(adt_df['Volume (Log Normalized)'], adt_df['Viscera-Shell Weight Ratio'] , marker='*', c=plot_color, alpha=0.5)
ax1.set_xlabel('Volume (Log Normalized)')
ax1.set_ylabel('Viscera-Shell Weight Ratio')

sct2 = ax2.scatter(adt_df['Volume (Log Normalized)'], adt_df['Viscera-Shucked Weight Ratio'] , marker='*', c=plot_color, alpha=0.5)
ax2.set_xlabel('Volume (Log Normalized)')
ax2.set_ylabel('Viscera-Shucked Weight Ratio')

sct3 = ax3.scatter(adt_df['Volume (Log Normalized)'], adt_df['Viscera-Whole Weight Ratio'] , marker='*', c=plot_color, alpha=0.5)
ax3.set_xlabel('Volume (Log Normalized)')
ax3.set_ylabel('Viscera-Whole Weight Ratio')


sct4 = ax4.scatter(adt_df['Volume (Normalized)'], adt_df['Viscera-Shell Weight Ratio'] , marker='*', c=plot_color, alpha=0.5)
ax4.set_xlabel('Volume (Normalized)')
ax4.set_ylabel('Viscera-Shell Weight Ratio')

sct5 = ax5.scatter(adt_df['Volume (Normalized)'], adt_df['Viscera-Shucked Weight Ratio'] , marker='*', c=plot_color, alpha=0.5)
ax5.set_xlabel('Volume (Normalized)')
ax5.set_ylabel('Viscera-Shucked Weight Ratio')

sct6 = ax6.scatter(adt_df['Volume (Normalized)'], adt_df['Viscera-Whole Weight Ratio'] , marker='*', c=plot_color, alpha=0.5)
ax6.set_xlabel('Volume (Normalized)')
ax6.set_ylabel('Viscera-Whole Weight Ratio')


plt.show()
In [12]:
colors=adt_df['Rings'].apply(lambda x : float(x)).values

#colors=adt_df['Rings (Log)'].values

colors -= colors.min()
colors *= (1.0/colors.max())

cm = plt.cm.jet
plot_color = cm(colors)

fig = plt.figure(figsize=(14,10))
ax1,ax2,ax3 = fig.add_subplot(231), fig.add_subplot(232), fig.add_subplot(233)
ax4,ax5,ax6 = fig.add_subplot(234), fig.add_subplot(235), fig.add_subplot(236)


sc1 = ax1.scatter(adt_df['Volume (Log Normalized)'], adt_df['Whole weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax1.set_xlabel('Volume (Log Normalized)')
ax1.set_ylabel('Whole weight (Normalized)')

sc2 = ax2.scatter(adt_df['Volume (Normalized)'], adt_df['Whole weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax2.set_xlabel('Volume (Normalized)')
ax2.set_ylabel('Whole weight (Normalized)')

sc3 = ax3.scatter(adt_df['Volume (Log Max Normalized)'], adt_df['Whole weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax3.set_xlabel('Volume (Log Max Normalized)')
ax3.set_ylabel('Whole weight (Normalized)')



sc4 = ax4.scatter(adt_df['Volume (Log Normalized)'], adt_df['Shucked weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax4.set_xlabel('Volume (Log Normalized)')
ax4.set_ylabel('Shucked weight (Normalized)')

sc5 = ax5.scatter(adt_df['Volume (Normalized)'], adt_df['Shucked weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax5.set_xlabel('Volume (Normalized)')
ax5.set_ylabel('Shucked weight (Normalized)')

sc6 = ax6.scatter(adt_df['Volume (Log Max Normalized)'], adt_df['Shucked weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax6.set_xlabel('Volume (Log Max Normalized)')
ax6.set_ylabel('Shucked weight (Normalized)')




plt.show()
In [13]:
colors=adt_df['Rings'].apply(lambda x : float(x)).values
#colors=adt_df['Rings (Log)'].values

colors -= colors.min()
colors *= (1.0/colors.max())

cm = plt.cm.jet
plot_color = cm(colors)

fig = plt.figure(figsize=(14,10))
ax1,ax2,ax3 = fig.add_subplot(231), fig.add_subplot(232), fig.add_subplot(233)
ax4,ax5,ax6 = fig.add_subplot(234), fig.add_subplot(235), fig.add_subplot(236)


sc1 = ax1.scatter(adt_df['Volume (Log Normalized)'], adt_df['Shell weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax1.set_xlabel('Volume (Log Normalized)')
ax1.set_ylabel('Shell weight (Normalized)')

sc2 = ax2.scatter(adt_df['Volume (Normalized)'], adt_df['Shell weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax2.set_xlabel('Volume (Normalized)')
ax2.set_ylabel('Shell weight (Normalized)')

sc3 = ax3.scatter(adt_df['Volume (Log Max Normalized)'], adt_df['Shell weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax3.set_xlabel('Volume (Log Max Normalized)')
ax3.set_ylabel('Shell weight (Normalized)')



sc4 = ax4.scatter(adt_df['Volume (Log Normalized)'], adt_df['Viscera weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax4.set_xlabel('Volume (Log Normalized)')
ax4.set_ylabel('Viscera weight (Normalized)')

sc5 = ax5.scatter(adt_df['Volume (Normalized)'], adt_df['Viscera weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax5.set_xlabel('Volume (Normalized)')
ax5.set_ylabel('Viscera weight (Normalized)')

sc6 = ax6.scatter(adt_df['Volume (Log Max Normalized)'], adt_df['Viscera weight (Normalized)'] , marker='*', c=plot_color, alpha=0.3)
ax6.set_xlabel('Volume (Log Max Normalized)')
ax6.set_ylabel('Viscera weight (Normalized)')

plt.show()

3D Scatterplots

We can make relationships between variables a little more clear by popping out the scatterplots into a 3D surface. This makes the relationships between input variables and responses much more clear.

In [14]:
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
In [15]:
fig = plt.figure(figsize=(10,4))

colors=adt_df['Rings (Log)'].values
colors -= colors.min()
colors *= (1.0/colors.max())
cm = plt.cm.get_cmap('RdYlBu')
cm = plt.cm.jet
plot_color = cm(colors)


ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')


ax1.scatter(adt_df['Volume (Normalized)'], adt_df['Whole weight (Normalized)'] , adt_df['Rings (Log)'], c=plot_color)
ax2.scatter(adt_df['Volume (Log Normalized)'], adt_df['Whole weight (Normalized)'] , adt_df['Rings (Log)'], c=plot_color)

ax1.set_xlabel('Volume (Normalized)')
ax1.set_ylabel('Whole weight (Normalized)')
ax1.set_zlabel('Rings (Log)')

ax2.set_xlabel('Volume (Log Normalized)')
ax2.set_ylabel('Whole weight (Normalized)')
ax2.set_zlabel('Rings (Log)')

plt.show()
In [16]:
fig = plt.figure(figsize=(10,4))

colors=adt_df['Rings'].apply(lambda x : float(x) ).values
colors -= colors.min()
colors *= (1.0/colors.max())
cm = plt.cm.get_cmap('RdYlBu')
cm = plt.cm.jet
plot_color = cm(colors)


ax1 = fig.add_subplot(121, projection='3d')
ax2 = fig.add_subplot(122, projection='3d')


ax1.scatter(adt_df['Volume (Normalized)'], adt_df['Whole weight (Normalized)'] , adt_df['Rings'], c=plot_color)
ax2.scatter(adt_df['Volume (Log Normalized)'], adt_df['Whole weight (Normalized)'] , adt_df['Rings'], c=plot_color)

ax1.set_xlabel('Volume (Normalized)')
ax1.set_ylabel('Whole weight (Normalized)')
ax1.set_zlabel('Rings')

ax2.set_xlabel('Volume (Log Normalized)')
ax2.set_ylabel('Whole weight (Normalized)')
ax2.set_zlabel('Rings')

plt.show()

The above figure is an important one to understanding our system. The plot shows two measured variables - a multiplicative combination of the measured dimensions of the abalone, and the whole weight of the abalone. These map to the age of the abalone, indicated by the color and height of the points, along a three-dimensional surface. This indicates a clear relationship between the size of the abalone and its weight.

What's not as consistent is the growth rate of abalones - meaning, some abalones move along the volume-weight curve much faster than others. The abalone starts at the left end, and "ends" at the right end (maxing out at some largest possible abalone size). For a given location along that curve, there's a probability distribution for the number of rings an abalone will have. Early on in the curve, an abalone is more likely to have a smaller number of rings, but there are several abalones with many rings that are small. Very far along the curve, the distribution of abalone ring numbers still has a big spread, with lots of large young and large old abalones.

It would be helpful to have a third variable that would "smear out" this surface a bit more, something that would more clearly divide older and younger abalones. But here's the real problem:

  • Three variables relate to the abalone's physical dimensions; these give us a "volume" dimension (units of length cubed)
  • Four variables relate to the abalone's mass; these give us a "mass" dimension (units of mass)

However, the age of the abalones is a time variable - as is the "rate" at which the abalones advance along the volume-weight growth curve. We need something that correlates to that.

We can get the magnitude of the three-component vector, weight, volume, age:

In [17]:
adt_df['Growth Magnitude'] = ( adt_df['Rings (Normalized)'].apply(lambda x:x*x)  \
                        + adt_df['Volume (Normalized)'].apply(lambda x:x*x) \
                        + adt_df['Whole weight (Normalized)'].apply(lambda x:x*x) )
adt_df['Growth Magnitude'] = adt_df['Growth Magnitude'].apply(lambda x : np.sqrt(x))
In [18]:
fig = plt.figure()
sns.distplot(adt_df['Growth Magnitude'])
plt.show()
In [20]:
adt_df['Growth Rate'] = ( adt_df['Volume (Normalized)'].apply(lambda x:x*x) \
                        + adt_df['Whole weight (Normalized)'].apply(lambda x:x*x) )
adt_df['Growth Rate'] = adt_df['Growth Rate'].apply(lambda x : np.sqrt(x))
adt_df['Growth Rate'] = adt_df['Growth Rate']/adt_df['Rings (Normalized)']
In [21]:
fig = plt.figure()
sns.distplot(adt_df['Growth Rate'])
plt.show()

The growth rates have a bimodal distribution - a promising result that indicates we may be able to find a variable that will separate abalones into the slow-growers and the fast-growers.

Our path forward, then, is to start to cluster. We are going to look for similarities between abalones with similar "velocities" on the volume-weight curve, and try and find ways to identify said clusters.

Start by looking at the sex of the abalones.

In [22]:
adt_df['SexNum'] = adt_df['Sex'].eq('M').map(lambda x : float(x))

fig = plt.figure()
sns.distplot(adt_df['Growth Rate'][adt_df['SexNum']>0.9],label='male')
sns.distplot(adt_df['Growth Rate'][adt_df['SexNum']<0.1],label='female')
plt.legend()
plt.show()

No luck - whatever is making abalones grow at two (generally different) rates is not gender.

In [23]:
colors = adt_df['Growth Rate']
colors -= colors.min()
colors *= (1.0/colors.max())
cm = plt.cm.get_cmap('RdYlBu')
cm = plt.cm.jet
plot_color = cm(colors)

fig = plt.figure()

plt.scatter( adt_df['Diameter (Normalized)'], adt_df['Height (Normalized)'] , alpha=0.3, marker='*', c=plot_color)
plt.show()
In [24]:
fig = plt.figure(figsize=(6,4))


colors=adt_df['SexNum']
colors -= colors.min()
colors *= (1.0/colors.max())
cm = plt.cm.copper
plot_color = cm(colors)

ax1 = fig.add_subplot(111, projection='3d')

ax1.scatter(adt_df['Volume (Max Normalized)'], adt_df['Whole weight (Normalized)'] , adt_df['Rings'], c=plot_color)

ax1.set_xlabel('Volume (Max Normalized)')
ax1.set_ylabel('Whole weight (Normalized)')
ax1.set_zlabel('Num Rings')

plt.show()
In [ ]:
 
In [ ]: